Probabilistic Protein Function Prediction from Heterogeneous Genome-Wide Data
نویسندگان
چکیده
Dramatic improvements in high throughput sequencing technologies have led to a staggering growth in the number of predicted genes. However, a large fraction of these newly discovered genes do not have a functional assignment. Fortunately, a variety of novel high-throughput genome-wide functional screening technologies provide important clues that shed light on gene function. The integration of heterogeneous data to predict protein function has been shown to improve the accuracy of automated gene annotation systems. In this paper, we propose and evaluate a probabilistic approach for protein function prediction that integrates protein-protein interaction (PPI) data, gene expression data, protein motif information, mutant phenotype data, and protein localization data. First, functional linkage graphs are constructed from PPI data and gene expression data, in which an edge between nodes (proteins) represents evidence for functional similarity. The assumption here is that graph neighbors are more likely to share protein function, compared to proteins that are not neighbors. The functional linkage graph model is then used in concert with protein domain, mutant phenotype and protein localization data to produce a functional prediction. Our method is applied to the functional prediction of Saccharomyces cerevisiae genes, using Gene Ontology (GO) terms as the basis of our annotation. In a cross validation study we show that the integrated model increases recall by 18%, compared to using PPI data alone at the 50% precision. We also show that the integrated predictor is significantly better than each individual predictor. However, the observed improvement vs. PPI depends on both the new source of data and the functional category to be predicted. Surprisingly, in some contexts integration hurts overall prediction accuracy. Lastly, we provide a comprehensive assignment of putative GO terms to 463 proteins that currently have no assigned function.
منابع مشابه
Multi-Instance Metric Transfer Learning for Genome-Wide Protein Function Prediction
Multi-Instance (MI) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with multiple instances. Many studies in this literature attempted to find an appropriate Multi-Instance Learning (MIL) method for genome-wide protein function prediction under a usual assumption, the underlying distribution from testing...
متن کاملBayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition
Various emerging quantitative measurement technologies are producing genome, transcriptome and proteome-wide data collections which has motivated the development of data integration methods within an inferential framework. It has been demonstrated that for certain prediction tasks within computational biology synergistic improvements in performance can be obtained via integration of a number of...
متن کاملA Survey of Computational Methods for Protein Function Prediction
Rapid advances in high-throughout genome sequencing technologies have resulted in millions of protein-encoding gene sequences with no functional characterization. Automated protein function annotation or prediction is a prime problem for computational methods to tackle in the post-genomic era of big molecular data. While recent community-driven experiments demonstrate that the accuracy of funct...
متن کاملPrediction of Protein-protein Interactions on the Basis of Evolutionary Conservation of Protein Functions
MOTIVATION Although a great deal of progress is being made in the development of fast and reliable experimental techniques to extract genome-wide networks of protein-protein and protein-DNA interactions, the sequencing of new genomes proceeds at an even faster rate. That is why there is a considerable need for reliable methods of in-silico prediction of protein interaction based solely on seque...
متن کاملContext specific protein function prediction.
Although whole-genome sequencing of many organisms has been completed, numerous newly discovered genes are still functionally unknown. Using high-throughput data such as protein-protein interaction (PPI) information to assign putative protein function to the unknown genes has been proposed, since in many cases it is not feasible to annotate the newly discovered genes by sequence-based approache...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- PLoS ONE
دوره 2 شماره
صفحات -
تاریخ انتشار 2007